# Narrative Similarity Replication Package

This repository includes the public-facing readme for the replication package for "Quantifying Narrative Similarity Across Languages," published in Sociological Methods & Research. We are not able to share all scripts due to copyright. 

We have ordered the scripts (01.., 02...) by the order they
should be run in. Scripts with the same order number can be run in 
any order. In some cases replicators will not be able to entirely replicate all contents of all scripts. These include cases where doing so would require access to the raw text data, which we cannot share due to copyright, and cases where the code involves interacting with a GPT model through an OpenAI API and thus any replication will not be exact. For these cases we provide intermediary data files so replicators can compare the output of the other parts of the code with any saved output.

A few scripts also require replicators to run the script as an array job. For these cases we provide example slurm scripts that replicators can tailor to any HPC they have access to. 

## GPT4o Pipeline

1) code\_public/01\_embed\_sbert\_v2.py: This code runs the summaries through the bi-encoder embeddings. It includes both roberta and mpnet. 

2) code\_public/02\_cutoff\_sbert\_embeddings.R: This script measures between article cosine cutoff on bi-encoder sbert embeddings and tunes a cutoff for these embeddings. We pass pairs below cutoff through sbert cross encoder.

3) code\_public/03\_cross\_encode\_sts.py
This script run the bioweapons pairs through the Sbert cross encoder

4) code\_public/04\_gpt\_annotator\_bioweapons\_lists.R
This code enumerates the claims and subjects of each bioweapons doc using gpt4o.

5) code\_public/05\_prep\_gpt4o.R
This code prepares the cross encoder scores for gpt4o annotation. It also tunes the cross encoder cutoff. 

6) code\_public/06\_gpt\_fine\_tune.R 
This code fine tunes GPT with the purposive sampling data. 

7) code\_public/07\_gpt\_annotator\_bioweapons.R
This code runs the GPT4o zero-shot annotator on the candidate pairs 

8) code\_public/07\_gpt\_annotator\_finetune\_bioweapons.R
This code runs the fine tuned GPT4o annotator on the candidate pairs 

## Alternative Estimators

1) code\_public/08\_ngrams.R
This code identifies overlapping pairs with exact text reuse. 

2) code\_public/08\_relatio\_extract\_features.py
This code extracts the relatio features. 


3) code\_public/08\_relatio\_merge.R
This code uses the extracted relatio features to measure similarity between articles.

4) code\_public/08\_stm.R
This code runs the STM model on the bioweapons docs and creates the topic clusters. 

## Analysis

1) code\_public/09\_pull\_together\_data.R
This code pulls together the final data for the gpt4o annotations. It also creates gpt4o precision sets. 

2) Recall Analysis: code\_public/09\_estimate\_recall.R 

3) Precision Analysis: code\_public/09\_estimate\_precision.R
This code combines the precision estimates and and creates the
overall f1 estimates for the study. It also creates the fine tuning
training data. 

4) Case Study: code\_public/10\_analysis.R
This code includes all of the bioweapons case study
analysis. 

## Example Slurm Scripts

Example slurm scripts are included in the folder "example\_slurms." Replicators can match the slurm script to the code file by the name of the slurm script. Slurm scripts are included for code scripts which need to be run as array jobs, as indicated in the header of the file. Replicators should tailor the slurm scripts to the HPC systems they have access to. 
